Driver analysis for the candy-power-ranking dataset

Overview and objectives

The "candy power ranking" dataset (provided by FiveThirtyEight, distributed under the CC0:Public Domain license, found on github) contains 85 candy brands. Included are the overall attributes (10 attributes, e.g., contains chocolate, caramel, or nougat), price as a percentage, and picking rate (as a percentage) against the compared candy.

The participants' task was to choose their preferred candy from a pairwise comparison. While the authors briefly explained the method of participant selection, no further details about the data collection are known. For the calculations, it is assumed that the design was balanced and that each candy was shown an equal number of times and paired against each other candy an equal number of times.

The objective of this analysis is twofold. The first goal is to identify the most important characteristics for the preferred sweets. The second is to formulate the findings found into a recommondation for action.

Description of the variables:
(taken from the official github README.md)

Header Description
chocolate Does it contain chocolate?
fruity Is it fruit flavored?
caramel Is there caramel in the candy?
peanutalmondy Does it contain peanuts, peanut butter or almonds?
nougat Does it contain nougat?
crispedricewafer Does it contain crisped rice, wafers, or a cookie component?
hard Is it a hard candy?
bar Is it a candy bar?
pluribus Is it one of many candies in a bag or box?
sugarpercent The percentile of sugar it falls under within the data set.
pricepercent The unit price percentile compared to the rest of the set.
winpercent The overall win percentage according to 269,000 matchups.



Procedure



Data import

Dataset was imported and is valid. There are 58 brands and no missing values in the features. All features are numerics, except the competitorname (which should be a string or an object in this case). As we have already seen, winpercent is formatted as a natural percentage. To work on the same feature basis in the train and test datasplits we should divide winpercent also by 100 to match the other percentage columns.

Building train and test split

Before starting to work on the model, a train/test split is created.

Overview

Let's start by getting a feel for the data itself. In particular the distribution.

At first glance, winpercent looks closle to normal distributed. However, because of the binning (for the plot) we should double check it.

For many statistical tests is a normal distributed dependend variable a must.

Visually, the winpercent tends to be a little bit skewed.
Nevertheless, better safe than sorry, let's check it with statistics.

Shapiro, K² and Anderson-Darling all do not reject the null hypothesis (p-values are greater than .5 and crit-value is higher than the test value). Therefore, it's safe to assume a normally distributed dependend variable and parametric models are usable for higher test power.

Let's get a feel for the relationship between each of the features with the dependent variable and also between the features themselfe.

With the exception of caramel, all features appear to be related to winpercent with chocolate appearing to be by far the best predictor at the moment.

Before proceeding, we should check whether the features are correlated with themselves. This information is important for further interpretation of the regression results and evaluation of feature importance.

To check for multicorrelation, we can use the VIF score:

If the characteristics are not related, the VIF coefficient should be 1.
In our case its between 1 and 5 (with the highest values for bar, chocolate, fruity & nougat) and therefore they are moderately related.

Nevertheless, we should take this fact into account when interpreting the meanings of the characteristics.
Another option is to use different methods to assess importance, such as Shapley values (aka. dominance analysis) or relative weights (for an great overview please refere to Nathans, Oswald & Nimon, 2012.)

But in order: Datacleaning and finding the optimal model

Datacleaning & further Preprocessing

The winpercent feature has already been transformed to be in the same range as the other two features in percentage. All columns have non-missings (as indicated by the DataFrame.info). The scales are binary and continuous and already exist as numerical values.
No feature conversion (like one-hot-encoding) is therefore not required.

Feature scaling is most likely not required due to two facts:


Yet, just to be on the safe side, we should review the general descriptive statistics:

Visually, there is not much difference between the variables. Let's double check this with a statistical test:

Statistically there is also no big difference, thus leaving the features in the original range.

Normally a outlier analysis would follow. There are multiple reasons to skip this process:

Formulating the model

The idea is to use cross validation to account for the smaller sample size.
This method is combined with an grid search over different learners to find the best base learner for the formulation of the model.

It's probably also better to use more simpler learners (fewer parameters to tune) to account for possible overfitting during training steps.

We can see already three things:

So let's do a re-ranking by sorting on the "delta" and throwing out models with test score below .2

Interesting:
The second place (OrthogonalMatchingPursiut) shows up as first place this time. Let's plot it:

Indeed, OrthogonalMatchinPursiut looks like the "best" learner compared to the other six.
Although the model does not deliver dream scores, its by far the best among the available ones.

But why don't we just take the model with the best training values (as the graph below shows for the linear regression models)?

The problem with such an approach is that we would choose a model that only pretends to be good. In fact, the linear regression scores only at about 0.15 for unseen data (while the training score imposes a model fit of round about 0.7).
This implies that a linear regression model is unlikely to provide any explanatory power. The best strategy in this case is to rely on the model with the highest test score (among the available) and try to estimate the feature importance on it.

Let's proceed with the OrthogonalMatchingPursiut algorithmn and check the final score on the hold-out data:

So, in the end, the cross validity score did not decrease compared to the hold-out data score (.38 vs. .37). Although the value is still not great, the low deviation between both scores indicate a relativ robust model.

Let's take a look at the regression weights to get a first impression of the importance of each variable:

As we can see, the most important aspect (by far) when choosing a candy is chocolate. Nevertheless, let's try to get a feature ranking.

Predictor importance

The idea is to calculate the shapley values as a proxy for the importance. To do this, a dominance analysis is performed.
Dominance analysis is quite computationally expensive. This method calculates the degree of explanation for each possible feature combination and extracts the unique importance for each feature. Therefore, we need to run:

$$2^p-1$$

calculations (while p euqals the number of features). In our case it is:

$$2^{11}-1 = 2047 $$

The curse of the relative small data set is at the same time a blessing, as it allows us to perform all calculations quite quickly.

Results

As has been suggested many times (e.g., by correlations, etc.), chocolate is by far the most important predictor for candy success.
But we are not done yet. Let's check some more things.
One of the side questions raised is what type of candy to focus on: cookies or gummies. Also, we should take into considertation the feature which drives the price the most.

Further Analysis

We will first analyze the prefered candy type.
Unfortunately, there is no clear indicator for goup/type membership in the dataset.
But perhaps we can gain this information by splitting the dataset by the characteristics chocolate and fruity:

If I had to guess, the chocolate based candys are cookies, fruit based are gummies and the rest is something else. Let's search the net quickly and see if we are correct

Well, acutaly, these candies are more like: chocolet bar vs. gummies vs. blends vs. others (like gummies that are not fruity).
(Also, not every picture suits the competitor name perfectly. E.g. "one dime" and "Daim")
Unfortunately, only two of these four types have a reasonable sample size (which is also comparatively equal). So let's check which candy type - chocolet bar vs. gummies is in generall more prefered by the consumers (If I had to guess again, I bet it is the chocolet bar).

The chocolate bar candy typ has a higher win rate by about 20pp. on average.
This result is in consistent with the importance analysis: Chocolate based products are more preferede by consumers, therefore the chocolate bar should also be prefered candy.
Last but not least, we should check which product characteristic drives the price of the products the most:

Accordin to the analysis, the biggest drivers for the price are bar, chocolate & sugarpercent. In one word, it looks like the chocolate bar candies are more expensive than the other candies. Yet, they are prefered more often.


Summary

Summing up all infos:


However, there are still unanswered questions that are important for a recommondation:
The production cost for the candy is not known. It is possible that chocolate-based products are not as profitable due to increased costs in manufacturing.

In addition, other situation-related questions are also unresolved. For example consumption occasions. If gummy bears are consumed at way more occasions - even though they are not the preferred product - the total sales volumen would still be higher for the bears.
Also, the data is still very limited. It is likely that a better model can be found base on more data alone.

The following steps are therefore recommended: